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Abstract: Small angle X-ray scattering (SAXS) is used for low resolution structural characterization of proteins often in 
combination with other experimental techniques* After briefly reviewing the theory of SAXS we discuss computational methods 
based on 1) the Debye equation and 2) Spherical Harmonics to compute intensity profiles from a particular macromolecular 
structure. Further, we review how these formulas are parameterized for solvent density and hydration shell adjustment. Finally we 
introduce our solution to compute SAXS profiles utilizing GPU acceleration. 

Mini Review Article 



Introduction 



Small angle X-ray scattering (SAXS) is an experimental structural 
characterization method for rapid analysis of biological 
macromolecules in solution [1-6]. Because the samples do not need to 
be crystallized, they can be studied in different pH environments and 
concentrations leading to insightful structure-function relationships. 
The overall SAXS scattering profile is calculated by subtracting the 
scattering profile of the blank buffer solution from the profile of the 
sample dispersed in solution. SAXS data has been used to filter a set 
of protein models by comparing the SAXS profile of each model with 
the experimental SAXS profile [7,8]. The SAXS profile has been 
incorporated as a term in the scoring function to obtain a protein 
model consistent with the experimental SAXS data [9]. An exciting 
feature in modern SAXS is identifying and modeling protein 
flexibility from an ensemble set of different conformers to fit 
experimental SAXS data [10,11]. This requires a large library of 
starting conformers as input to the algorithm [12]. After a suitable 
library of conformers has been generated or found, the experimental 
SAXS data are used as a constraint in an algorithm to determine 
which combination of conformers optimally fit the data. The 
scattering intensity (I) is represented by a linear combination of the 
selected conformers. In this process the algorithm must decide 1) 
Which conformers to use and 2) How many conformers are required 
to accurately recreate the experimental SAXS profile. Critical to the 
success of this task are the underlying algorithms used to compute a 
SAXS profile from a proposed protein model. In this review we 
highlight different methods to accomplish this task. We recognize 
that these methods are not exhaustive of all methods, but represent a 
sampling of different approaches that provide insight to the process 
of computing SAXS profiles from atomic coordinates. For a more 
comprehensive review of small angle X-ray scattering theory we 
recommend several reviews [1-3,13]. 



X-ray Scattering Review 

X-ray scattering is observed when differences in electron density 
exist in a given sample and X-rays generated from a source device pass 
through the sample. Although both coherent and incoherent scattering 
is possible, we will confine our considerations to coherent scattering 
because incoherent scattering is negligibly weak at very small angles 
[1]. Elastic (without energy change) electron scattering is influenced 
by all atomic orbitals. Because atomic orbitals have different shapes 
according to their atomic group, the X-ray scattering provides 
information on the structure of the target sample. 

The scattering process occurs as electrons resonate with the 
frequency of the X-rays passing through the object. As the electrons 
resonate, they emit coherent secondary waves which undergo both 
constructive and destructive interference. Because of destructive 
interference, the superposition of waves with all possible phases will 
lead to zero scattering at a scattering angle of 29 [1]. The scattering 
maximum 1(0) will be theoretically observed at a scattering angle of 
zero where all waves are in phase. Because of the high intensity of the 
incident X-ray beam, a beam stop is placed between the detector and 
the beam to prevent it from distorting the scattering profile. 1(0) must 
therefore be computed rather than experimentally observed. 

To illustrate the scattering process, consider a linearly polarized 
monochromatic X-ray beam incident on a single electron with charge 
e and mass m. The periodic electric field of the incident X-ray 
produces a force on the electron (F = qeE) where F is the overall force 
the electron experiences, qe is the charge of the electron and E is the 
electric field of the incident X-ray. This force causes the electron to 
oscillate with the same frequency as itself. The equations governing 
this behavior are shown below beginning with the electric field 
equation: 

E = Eoe''^'^^-^^ (1) 
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where E is the electric field, Eo is the maximum value of the electric 
field, CO is the frequency of oscillation of the wave-field, t is time, and 
6 is the phase constant. 

By Newton's second law of motion we equate the two equations of 
force: 

F = ma = q^E = q^Eoe'^"^^'^^ (2) 
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Figure 1. SAXS Experimental Setup. X-rays with a constant wavelength X are first focused by the collimator and then pass through the purified sample in 
solution. A small fraction of the X-Rays scatter as they encounter electrons in the sample. The detector captures these scattered X-rays as intensity values. The 
final scattering profile is the difference between the profile of a blank buffer solution and a solution containing the purified sample. 



where m is the mass and a is the acceleration. The acceleration the 
electron experiences due to the periodic electric field is computed by 
dividing by the mass: 



m 

where the amplitude Ao is: 



(3) 



(4) 



The electromagnetic radiation at a given distance with magnitude r 
from the charge q that experiences acceleration a has an electric field 
component: 



£ = - 



(5) 



where c is the speed of light, r is the magnitude of the position vector, 
qe is the charge, a is the acceleration and a is the angle between a and 
r. If the position of r is perpendicular to the incident beam (which is 
true for SAXS experiments) then a = 90° and sina = 1. Combining 
this simplification with the electric field component and substituting 
Ao for a: 



£ = - 



Qe Qe 

'~ ^ 

c^r m 



mc^J r 



(6) 



Now imagine instead of a single electron, we have an electron 
cloud. As incident X-rays pass through an electron cloud with the 
origin at the center, most of them travel through the cloud without 
scattering, while a small fraction (< 1%) of the incident X-rays are 
scattered. This can be seen from the scattered to incident amplitude 
ratio: 



Eq XmC^J r r 



(7) 



where re is the constant Thomson scattering length and r is the 
distance from the object to the detector. 



n = ■ 



1 Qi 



= 2.818 X IQ-^^m 



(8) 



Because re is small, the scattered-to-incident amplitude ratio 
reveals that a single electron scatters a very small fraction of the 



incident X-rays. For example, at a sample to detector distance of three 
meters (typical for SAXS experiments), the amplitude ratio is: 



2.8l8xi0"^^m 



10" 



(9) 



Table 1. Numerical values of critical constants in Thompson 
Scattering. 



Name 


Value 


qe Electron charge 


1.602 xlO-'^C 


me Electron rest mass 


9.107x10'^^ kg 


c Speed of Light 


2.998x 10' m/s 


So Permittivity of free space 


8.854x 10-^' C'/N-m" 



For an fuller description of the physics of X-ray scattering and the 
mathematics of waves we refer to the notes of Dr. Robert 
Blessmg[14]. 

Because the scattered waves are coherent, the resulting amplitudes 
are added and the intensity is given by the absolute square of the 
amplitude [1]: 



A=Y.UAr,: I=\A' 



(10) 



where An is the resulting amplitudes of all scattered waves and I is the 
scattering intensity. In Thompson elastic scattering all secondary 
waves have the same intensity and is given by [1]: 



1 l+cos^2e 

2 7 



(11) 



where Ip is the primary intensity and L is the intensity of the 
secondary waves. The term e^/mc^ is the classical electron radius and, 
r is the distance from the object to the detector. For small angles the 
polarization factor (1 + cos^20)/2 is approximately one leaving [1]: 



(12) 
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The Momentum Transfer vector 

We will assume the amplitude and intensity of all secondary 
waves to be one for this discussion. With this framework, each 
secondary wave is represented by the complex function e'^ where (p is 
the phase. Because the amplitude and intensity are one, all waves 
differ only by their phase. The phase of the scattered wave depends 
on the position of the oscillating electrons in space. 
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The Scattering Intensity Curve can be derived from the 
Electron Density Function 

The term electron density is frequently used in the literature in 
the place of electron density difference or contrast. The electron 
density p is the number of electrons per unit volume. In SAXS 
experiments only the electron density difference pz — pi (pz is the 
electron density of the sample, pi is the electron density of the 
solvent) is measurable. If pz = pi, then scattering is not observed 
because the waves scattered in any direction will cancel out. During a 
SAXS experiment the electron density of the buffer solution is 
subtracted from the density of the combined sample and buffer 
solution leaving the electron density of the sample without 
background solution. 

The electron density function p(r) is defined in real space for 
non-negative values. It is a histogram of equivalent pairwise atomic 
distances in a given sample. Because of the solution subtraction, the 
electron density it is zero everywhere except for defined electron 
distances in the sample where identical distances add together. 



Figure 2. X-Ray Scattering: Adapted from Small Angle X-ray Scattering [1]. 
Incident (so) and Scattered X-rays (s) with the derivation of the momentum 
transfer vector q. 



The phase of the secondary waves is Ztt/A. multiplied by the path 
difference between the scattered and incident waves. In the diagram, 
we let so represent the direction of the incident beam and we let s 
represent the direction of the scattered beam. The path difference of 
a point P, specified by r, against the origin O is: -r(s-So). The phase is 
given by [1]: 



= - ^r(5 - 5o) ; (p = -qr 



(13) 



The term (s-So) is symmetric to the incident and scattered beam 
with magnitude of 2sin9. In this representation 0 represents half the 
scattering angle. The momentum transfer vector q is independent of 
the distance to the detector and the wavelength (A) and defines the 
scattering curve in reciprocal space with units of A'^ The momentum 
transfer vector has the same direction as (s-So) and the magnitude is 
given by substituting 2sin0 for (s-So): 



l<7l 



47r-sin(0) 



(14) 



where 29 is the scattering angle. We refer to q as the magnitude of the 
momentum transfer vector q. In the literature, this term has been 
defined multiple ways and one must be aware of the convention used. 
For example the symbols h and s have been used in place of q. 
Sometimes s is defined as s = (2sin0)/A. with q = 27rs. Others define 
0 rather than 20 as the scattering angle. In this review we use the 
convention for q shown above with 20 as the scattering angle. Large 
interatomic distances contribute primarily to the scattered X-ray 
intensity at small scattering angles, whereas short interatomic distances 
primarily contribute to X-ray intensity at large scattering angles. The 
information content of a SAXS profile is small compared to other 
high resolution experimental techniques because the overall scattering 
profile represents the orientationally averaged contribution of all 
atoms in all orientations. The SAXS scattering curve contains 
information related to the overall shape of the molecule and is 
routinely used for the validation of structural models [15,16]. 



I 



I 
I 

I I I I 
I I I I 




0 1 2 3 4 5 

r (distance) 

Figure 3. The pairwise distance distribution function adapted from X-ray 
solution scattering (SAXS) combined with crystallography and 
computation: defining accurate macromolecular structures, 
conformations and assemblies in solution [3]. Pair-wise distances 
between each atom are represented. The distances are symmetric and 
are represented twice by the double arrows. The P(r) function will be zero 
whenever a particular distance is not defined by the geometry of the 
sample. 



If we have the distance distribution function then the scattering 
curve I(q) can be calculated by Fourier inversion[l]: 



sin(qr) 
qr 



dr 



(15) 



Likewise the distance distribution function p(r) can be calculated by 
Fourier inversion of the scattering curve [1]: 



(16) 



Theoretical scattering curves can be computed for a model of a 
given shape and compared with experimental data using either the 
intensity calculation I(q) or the distance distribution function p(r). 
The distance distribution function allows the deduction of the largest 
particle dimension dmax and is the distance at which the p(r) drops to 
zero. 
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Figure 4. Originally from SAXS combined with crystallography and computation [3]. This figure depicts the experimental SAXS curves and parameters measured 
for Pyrococcus furiosis PF1282 rubredoxin (magenta), a 'designed' scaffoldin protein S4 (red), a 'designed' minicellulosome containing three catalytic subunits 
(green), and the DNA-dependent protein kinase (blue), (a) Dmax of the scattering particle is a simple function of molecular weight for perfect spheres, but not 
for proteins that adopt different shapes. Envelopes correspond to ab-initio models calculated from experimental curves using GASBOR. (b) The experimental 
scattering curves for each protein show that the intensity of scattering falls more slowly for rebredoxin (Rg 11 A ; magenta) than the minicellulosome (Rg 82 A; 
green), (c) The linear region of the Guinier plot, from which Rg and 1(0) can be derived, is a function of the Rg. (d) Each protein has both a substantially different 
□max as well as pair-distribution function, reflecting the different atomic arrangements. 



Debye Formula for computing scattering profiles from Atomic 
Coordinates 

Proteins are built up from the arrangement of amino acids which 
are built up from the arrangement of atoms differing by side chain 
arrangement. Imagine a protein sample in a fixed orientation. The 
centers of mass of each atom may be designated by ri, rz, . . rn, and 
their amplitudes with respect to each mass center by fi, fz, . . fn. The 
total amplitude is [1]: 

Wm('?)=i:7=i/;('?)-e-''""^- (17) 



where the additional phase factor describes the position of the atom 
and fj(q) is the amplitude. The intensity is the absolute square of the 
amplitude, averaged over all orientations: 

m = fr = iz^iZUfjfk ■ e-"^^'rr>c)) (18) 

When j=k the phase factor reduces to one. This situation represents 
the contribution to the intensity diffracted by the atoms alone. The 
situation j^^k represents the interference between the atoms, according 
to the relative distance (rj-rk). Each amplitude f has a phase: 

fi = \\fj\\-^'''^ (19) 
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Splitting the atomic diffraction (j— k) from the interference between 
atoms (jv^k) yields: 

I(q) = + 2 ■ <i: Y.%k\fj\\fk\ e'('"-i'^+*'=-«'i)> (20) 

In SAXS experiments there is no fixed origin because particles are 
sampled in all orientations. The phase is dependent on a fixed origin. 
By averaging over all orientations and restricting atoms to be 
spherical, the phase vanishes, (cpk - (pj) = 0 and fj becomes independent 
of orientation. Furthermore, spherical averaging of all orientations is 
given by: 



sin(q-r;j) 



(21) 



This representation of the spherical averaging is known as the Debye 
factor [17]. The final Debye formula is: 



(22) 



In this format the amplitudes f are calculated by computing the 
atomic structure factors. The atomic diffraction and interference 
between atom sums can be combined together to give the form of the 
Debye equation frequently cited in the literature: 



sm{q-rij) 



(23) 



where rij = | ti — tj | are the x,y,z positions of atoms i and j. The 
Debye formula given above takes the atomic x,y,z coordinates as input 
and returns the intensity as a function of momentum transfer q. This 
double sum of all atoms in a given system for each computed q value 
has a computational cost of O(N^). The quadratic cost is a 
prohibitive barrier for atomic level application of the Debye formula 
for large systems (N > 10,000). In the case of structural refinement 
for SAXS, the scattering profile must be computed from all pairs of 
interactions with atoms in the molecule. In high-throughput 
applications the profile must be computed thousands of times, while 
in an iterative ensemble analysis, the profile must be computed 
hundreds of thousands of times. Because of the high computational 
cost, different methods have been developed to reduce the number of 
necessary calculations to compute intensity. Before we discuss the 
approximations to the Debye formula, we must first understand the 
structure factors f (q) and fj(q). 

Structure Factors and Form Factors 

The atomic form factor is a fundamental physical quantity in 
solid state physics. It is the Fourier transform of an electron 
distribution around a nucleus of a given atom and carries information 
on the electron wave function. The X-ray scattering power of a given 
atom will depend on the number of electrons it contains. As the 
number of electrons contained in an atom increases (higher atomic 
number), the scattering power increases. As the scattering angle 
increases, the scattering power decreases. A scattering angle of zero 
results in the maximum scattering factor for a particular atom which is 
equal to Z — the atomic number. The form factor approximations are 
based on the combination of relativistic Dirac-Slater wave functions 
and numerical Hartree-Fock wave functions [18-21]. These Hartree- 
Fock structure factors were computed from q = 0 to q = 1.5 at 
intervals of 0.0 lA'^ For convenience, they were fit to a 5-gaussian 
(Cromer-Mann) analytic function: 



(24) 



where fv,i (q) is the structure factor of a particular atom at a given q- 
value in vacuo. The constants ai, az, as, m, bi, hi, bs, b4, and c are the 
Cromer-Mann coefficients for a given atom, and q is the momentum 
transfer in inverse angstroms. Tables for the Cromer-Mann 
coefficients are found in the International Tables for X-Ray 
Crystallography [22]. This approximation is valid in the q-ranges for 
SAXS scattering experiments from 0 to ~ 0.33A"^ [2,3]. For larger q- 
ranges, a 6-gaussian approximation must be used which is valid from 
Oto~ 6.OA-I [21]. 

In addition to the vacuo contribution to the form factors, the 
solvent makes a critical contribution to the overall scattered intensity. 
The solvent effect is considered by modeling the solvent as an 
electron gas with density equal to the average electron density of the 
solvent[23]. Taking the solvent effect into account, the overall 
structure factor of the atom is the combination of the structure factor 
representing the excluded solvent subtracted from the form factor for 
a given atom: 



Fi(q) = fv,i(q) - fs,i(q) 



(25) 



where fs,i is the structure factor of the hypothetical atom that 
represents the displaced solvent. The displaced solvent scattering term 
fs.i is given by: 



fs,i(q) = PVje 4n 



(26) 



where p is the electron density of the solvent. For pure water this is 
0.3 34e A''^. Vi is the solvent volume V displaced by atom i and is 
calculated from the van der Waals radius of the atom. [23,24]. The 
exponential term is the normalized Fourier transform of the Gaussian 
sphere. This sphere corresponds to the excluded volume around the 
atom. 

The electron density surrounding the scattering body is calculated 
by computing the number of electrons per liter of solvent and then 
converting that to the number of electrons in a cubic angstrom. This 
excess electron density is then added to the density of pure water. 
Proteins have an electron density around 0.44e A'^[2]. The electron 
density of the solvent should maximize difference between itself and 
the electron density of the sample to maximize contrast in SAXS 
experiments. The derivation for the electron density of pure water 
with a density of 1 g/ mL is shown below: 

6.02 * IO23//2O Mo/ecu/es] r electrons ^\lmolH20^ \ Ig H2O i\lcm^ H2O] 



1 mol HyO 



IHyO Molecule. 



1 rl mol H2O1 r Ig H2O i 
J [ 18^ J [icm^ //20J [' 



1024 13 



'■ 0.334e A" 



Now that we have reviewed the theory of X-ray scattering and 
have an idea of the Debye equation with a costly double sum over all 
atoms, we are ready to review methods using the Debye equation 
designed to maximize accuracy while minimizing computation time. 

Fast approximation of the Debye Formula by Pantos and 
Bordas 

In 1994, Pantos and Bordas used an approach to simulate SAXS 
patterns of large molecules by building models of closely packed 
spheres that are much larger than individual atoms thereby reducing 
N for the calculation. This was incorporated into the software 
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package DALAL They used the Debye formula to compute an 
intensity profile of the proposed model [25]: 

Kq) = lU'j(q) + 2Y!l=,n=iFt{q)F^iqY^^^ ,j * k 

(27) 

The first sum gives the intensity for spheres in isolation, while the 
double sum give the contributions from density-density correlations. 
To reduce the computational task in the double summations of the 
Debye equation, all spheres were given the same radius and mass 
density. The structure factor product Fi(q)Fj(q) is now constant for 
each value of q and can be pulled out of the double sum. The Debye 
formula becomes: 

(28) 

At this point in the formulation. Pantos and Bordas have not 
compromised the accuracy of the calculation for the reduced sphere 
model. They moved the bulk of the computation to the initial state of 
the algorithm. The calculation of rij is still O(N^). To model large 
structures requiring a large number of spheres, they approximated 
pairwise distances between atoms. In this approach pair distances are 
grouped into a histogram of bin sizes based on the experimental data 
resolution. Without binning, the number of pairwise distance terms is 
equal to N(N-l)/2. In this method the distances were quantized to 
multiples of dmax/lOO where dmax is the maximum particle dimension. 
The resolution increases with decreasing bin size and decreases with 
increasing bin size. The resolution adjustment blurs the sampling grid 
by an undetectable amount in the resolution range of the simulation. 
The pair distance matrix of rjk values are now a vector of distances 
weighted by the number of distances occurring in a given bin. The 
scattering formula becomes: 

Kq) = Ijiq) + 2 F\q) I.^,Tm(r^) (29) 

where m(rk) is the bin population at pair distance rk and the limits of 
the sum are the number of distance bins. 

This method is valid when protein structures are modeled with 
multiple spheres of constant radii and mass density. When this 
condition is met, the structure factor calculation can be brought out 
of the double sum. The Debye calculation can then be binned leading 
to change of an O(N^) calculation to 0(N). Prior to this calculation 
the pairwise distances must be pre-computed and binned which is still 
an O(N^) calculation. The speed increase by this algorithm is 
dependent on the number of spheres used to model the system. An 
advantage of this method is that the pairwise distance matrix must 
only be computed once and can then be reused during the course of 
analysis. 

Calculation of SAXS profiles with the Debye formula from 
coarse-grained protein models 

In 2010, Stovgaard et al, used the Debye formula for calculating 
the scattering curve combined with a coarse-grained representation of 
protein structure to address the high computational cost [26]. This 
approach led to a significant speed-up in computational time when 
compared with the all atom calculation. In this approximation, amino 
acids were represented by two scattering bodies or dummy atoms — 
one representing the backbone, and the other representing the side 
chain. These dummy atoms were placed at the respective center of 
mass of the atomic group they represented. They had to estimate 2 1 



form factor values for this approximation — one for alanine, one for 
glycine, one for the backbone, and 18 for the remaining side chains. 
They recreated these functions for each of the 21 form factors by 
binning the q-range into intervals of equal width (0.015 A'^) and then 
computing a form factor estimate for each of the 2 1 form factor types 
in each of the q-bins. They sampled form factor values from a 
training set of 297 structures with lengths between 50 and 400 
residues and calculated a form factor estimate from the centroid in 
each bin. The SAXS curves generated through the Debye formula 
with dummy atom form factors for 50 proteins were compared with 
SAXS curves generated for the same proteins through CRYSOL with 
high agreement. 

This method is contingent upon the accuracy of the form factor 
estimates for the dummy atoms and relies on a training set of 297 
proteins to represent amino acids in nature. Amino acid residues 
behave differently in different environments, and caution must be 
used to ensure the training set accurately represents the environment 
of the protein of interest. The authors state that two additional 
developments with this method are needed: 1) a proper description of 
the hydration layer and 2) a probabilistic description of the 
experimental errors associated with a SAXS experiment. This is 
currently under development in the PHAISTOS software package. 

The incorporation of the hydration layer into the Debye 
Formula via the form factor equations 

In the same year that PHAISTOS was published, the Sali Lab 
published their approach to the Debye formula and made their web 
server FoXs publically available [27]. To account for the displaced 
solvent and hydration shell, the structure factor contribution for a 
given atom is given by: 

Fi(q) = fv,i(q) - Ci fs,i(q) + C2Sif„,i(q) (30) 

where fv,i (q) is the form factor of a particular atom at a given q-value 
without the effects of excluded volume and a water shell, and fs,i is the 
structure factor for the excluded volume, and the last term is the 
structure factor of the hypothetical molecule that represents the 
displaced solvent. Si is the solvent accessible surface area for a given 
heavy atom and fw,i is the form factor of water. This approach is 
novel because it models the hydration shell as a function of the 
solvent accessible surface area of a given atom. The parameter ci is 
used to adjust the electron density contrast while the parameter C2 is 
used to adjust the hydration shell thickness. The structure factor of 
water is given by the sum of all atomic form factors in water: 

/w,i(^) ~ 2 ^ fv,i(^R^ hydrogen ~^ fv,i(.R^ oxygen (31) 

The computed profile was fit to a given experimental SAXS profile 
by minimizing the chi function with respect to c, C\, and ci: 

^ = J^2f^.(=^=^)' o^) 

where lexp(q) and I(q) are the experimental and computed profiles, 
a(q) is the experimental error of the measured profiles, M is the 
number of points in the profile, and c is the scale factor. The 
minimum value of chi was found by a computing c\ on the interval of 
[0.95, 1.12] and ci on the interval of [0, 4.0] in steps of 0.005 and 
0.1. Linear least squares minimization was performed to find the 
value of c that minimized chi for each c\ and C2 combination. 
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Similar to DALAI, FoXs has the structure factor calculation 
moved out of the double sum of the Debye formula. Instead of 
modeling uniform space filling spheres, they assumed an identical 
modulation of f (q) for different atoms i: 



fi(q) = fi(0) • E(q) 



(33) 



where the modulation function E(q) is equal for all atoms. This 
approximation creates a system of different scattering masses but 
equal shape. The pairwise distance distribution function represents 
population at a given distance r and is given in this approximation as: 



(34) 



where (5(r-dij) is the Dirac-Delta distribution and r is a given pairwise 
distance. In this representation, only the form factor with a constant q 
= 0 is considered, which reduces the value to the atomic number Z of 
the given value. The intensity is given by: 



qr 



(35) 



The modulation function E^(q) is parameterized as: 

E2(q) = e^-*""') (36) 

The parameter b was determined by computing the SAXS profile 
with the original Debye formula using the non-approximated form 
factors and then computing the SAXS profile with the approximated 
form factors and initial guess of the b parameter. The parameter 
b=0.23lb0.01 A'^ was chosen to minimize the difference between 
both profiles from 30 random protein structures from the Protein 
Data Bank. This approximation typically speeds to calculation of the 
Debye formula by two orders of magnitude. 

The explicit incorporation of the hydration layer into the 
Debye Formula 

In 20 11 , the Zhang lab at the University of Michigan introduced 
SAXSTER, an online tool to improve protein template recognition 
by using SAXS profiles [28]. In their approach they also simulate the 
SAXS intensity profile according to the Debye equation. Instead of 
summing over all atoms, they sum over all atoms plus the explicit 
water atoms. The equation is: 



sm(q-rij) 



(37) 



where W is the number of "dummy*' water molecules around the 
protein representing the hydration shell. The initial structure factor 
equations are identical to equations previously shown. To account for 
the explicit water molecules around the model, they started from a 
face-centered cubic (FCC) lattice system with edge length Lceii. Each 
point in the lattice represents a water molecule. The overall structure 
factor is given by subtracting the excluded solvent from the atomic 
form factor and adding the explicit water contribution from the 
lattice. The protein structure is projected onto the FCC system and 
only water molecules in the range of 3.5-6.5 A to any Ca atoms are 
kept. The density of the water molecules in the lattice system is 
defined by: 



Pfcc 



^FCC , 
VpCC 



4k^ 
1? 



(38) 



where N is the number of points in the FCC lattice system, V is the 
volume of the system, k is the number of unit cells in the x,y,z 
directions and L = k Lceii. L represents the maximum length for each 
direction. In a FCC lattice system, the water contribution from each 
corner of the cubic cell is 1 /8 and the contribution from each face is 
1/2. There are eight corners and six sides yielding an effective water 
contribution of four (8(1/8) + 6(1/2)). Each water molecule 
consists of 10 electrons yielding 40 (water contribution of four 10 
electrons) electrons per cubic cell. The number of excess electrons per 
volume in the hydration shell relative to the bulk water is: 



^ _ 40 electrons _ 

— 73 — Pshell ~ Pbulk 

^cell 



(39) 



The thickness of the hydration shell is thus controlled by the edge 
Length of the FCC system. The threading-based models are 
composed of a-carbons only and the SAXS computations are given 
by: 



sm{qrij) 



(40) 



p(r) = Eilr 2:7=T n^fiq = 0)4^(<? = O) 5 (r - ry) (41) 

This form of the p(r) function is very similar to FoXs. The 
difference is that the water molecules are explicitly summed over. In 
the approximation, a new structure factor must be derived to represent 
the a-carbons: 



)l/2 



(42) 



where (...) denotes the average over all residues of the same type 
calculated from 200 randomly selected PDB structures. The term f(q) 
is computed by the initial structure factor equations previously shown. 
This procedure produces 20 effective structure factors for each amino 
acid type. In the case of water, its scattering factor is calculated by the 
modified Debye equation with n = 3, tij = 0 and Fi(q) being the 
vacuum form factors for either hydrogen or oxygen. 

Spherical Harmonics - A second widely used approach to 
address the computational cost of SAXS profile reconstruction 

In the methods previously described, the orientational averaging 
of the scattered waves was computed analytically using the Debye 
relation [17]: 



sin(q-r) 
q-r 



(43) 



Instead of analytically computing the orientational averaging, 
another method is to use a mathematical representation of the 
scattering body (or protein) that uses the rotational properties of 
spherical tensors. In this formulation the scattering body is expanded 
in terms of an infinite series of spherical harmonics. The 
orthogonality properties of the basis functions simplify the averaging 
of the harmonic series from which an overall scattering intensity can 
be computed. These basis functions are built from spherical Bessel 
functions, and normalized spherical harmonics of degree m and order 
L. This approach reduces the computational complexity from O(N^) 
to 0(N). 
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The scattering amplitude in vacuo of a particle with N atoms is: 

A„{q) = i:UfMy''"'> (44) 

where rj = (rj,COj) = (rj,0j,(pj) and fj is the corresponding atomic form 
factors. Spherical averaging is simplified by multipole expansion [29]: 

e = An tiTo' S^^-L i'ii {qr)Yl^ (t^)^^ (") (45) 

where jL(qr) are the spherical Bessel functions of order L and YLm(Q) 
are the spherical harmonics of order (L,m). The angular symmetry of 
Ylih is related to the symmetry of the multipoles: L=0 (monopole) 
L=l (dipole), L=2(quadrupole), etc. Substituting the multipole 
expansion with spherical harmonics for the exponential term yields: 

^vacuo (si) ~ 

TL=-L ^ni%rn{^) SjLi A(qr,)>7m(^^;) (46) 

where (rj,COj) are the polar coordinates of the jth atom. The partial 
amplitudes can be separated from the proceeding equation: 

Avacuoiq) = i:L=r^m=-LALm(q)Y,^(^) (47) 

where ALm(q) are the partial amplitudes and are given by: 

AimiQ) = 4ni^ fj(q)jL(qrj)Y,*^(coj) (48) 

Because of the orthogonality properties of spherical harmonics, the 
cross terms cancel and the intensity calculation is reduced to [30]: 

Ivacuoiq) = j\A,miqWdS(u) =I.tS''^m=-L\ALm(q)\' 

(49) 

The huge advantage of spherical harmonics is that the complexity 
is reduced from O(N^) to 0(N). The integrand for averaging over 
the sphere in the proceeding equation is approximated by an L = 
O(qD) band limited function in a spherical harmonic basis where q 
is the momentum transfer vector and D is the maximum dimension of 
the sample. It is insufficient to use L smaller than qD / 2 because any 
value less than this violate Nyquist Shannon sampling[31] for 
periodic functions. At least = O(q^D^) sampling points are needed 
to provide an accurate integration of bandwidth L. Any index above 
L does not improve the fit for a given qmax, while any index below L 
will result in systematic errors in the calculation[32]. 

CRYSOL — The incorporation of the hydration shell using 
spherical harmonics with multipole expansion to compute 
SAXS profiles from atomic coordinates 

By the early 1990s there were many studies showing the 
importance of modeling the water molecules surrounding a given 
macromolecule when recreating SAXS profiles from atomic 
coordinates. For example, Grossman et. al compared experimental 
SAXS profiles with SAXS profiles computed from different 
configurations of dimers, trimers, and tetramers. They optimized the 
agreement between experimental and simulated scattering profiles by 
placing solvent molecules on a diamond-shaped grid surrounding the 
structure[33]. In their results, the computed SAXS profile with the 
best fit to the experimental SAXS profile consisted of a solvent shell 
of 716 water oxygens up to a maximum distance of 3.15 A from the 



protein surface. Their results suggested that the water shell very close 
to the surface of a protein differs in electron density from the 
remaining bulk water and thus contributes to x-ray scattering. 



0 1 2 3 4 5 




Figure 5. Originally from Models, structures, interactions and scattering 
[2]. Accuracy shape representations using spherical harmonics. Top row: 
surface representations of truncated envelope functions of lysozyme. 
Second row: high-resolution envelope functions and Ca trace of the 
protein. The shape scattering intensity from lysozyme is shown along with 
the contributions from different multipoles. 

In 1995 Svergun et. al released CRYSOL — a program to 
compute SAXS intensity profiles from atomic coordinates while 
considering the hydration shell surrounding the target sample [24]. 
There were lingering questions concerning the true cause of the 
electron density contrast conditions surrounding a sample in solution. 
Was the density contrast caused by a water layer or could the contrast 
be explained by side chains moving freely on the protein surface? 
Three years later in 1998 Svergun et al. confirmed in a combined X- 
ray and neutron scattering study that the differing electron contrast 
conditions were more likely caused by a denser hydration shell rather 
than a higher mobility of the side-chains on the protein surface[34]. 
Water modeling is critical to the correct interpretation of SAXS 
profiles and computational methods are under development today to 
improve chemistry constraints, improve geometric constraints (surface 
curvature), and incorporate experimental data from high-angle 
SAXS[35]. 

Currently, popular approaches for modeling the hydration shell 
are to: 1) place water molecules on the surface of the protein, 2) 
simulate the solvation shell by surrounding the protein with a 
continuous outer envelope, 3) simulate the solvation shell and 
excluded volume by computing a modified scattering factor. 

CRYSOL employed the second approach to model the hydration 
shell and extended the multipole expansion and spherical harmonics 
formulation to handle not only the vacuo scattering, but the excluded 
volume and hydration shell. [9] 



Volume No: 8, Issue: II, e20 1 308006 



Computational and Structural Biotechnology Journal | www.csbj.org 



RecDnstructinn of SAXS Prnfiles frnm Prntein Structures 



In this formulation the intensity is given by: 

Kq) = {\Aa(q) - PoAc(q) + SpA,iq)\^)n (50) 

where Aa(q) is the in vacuo scattering, Ac(q) is the excluded volume 
scattering and Ab(q) is the border layer scattering, 6p = pb — po, 
where po is the average scattering density of the solvent surrounding 
the particle and pb is the average scattering density of the border layer 
around the particle with thickness A. ( stands for the average over 
all particle orientations and Q is the solid angle in reciprocal space, q 
= (q, Q). Each of the three amplitudes is represented via its multipole 
components. Because of the orthogonal properties of the spherical 
harmonics, all cross terms cancel in the average over Q, leading to: 

i(q) = 'Zi=o'Z'm=-Mim(q) - PoCimiq) + SpBi^(q)\^ 

(51) 

The value L defines the resolution of the particle. This approach 
works best with shapes that can be described using spherical 
harmonics which include most globular and extended proteins. 
Spherical harmonics is less adept at handling shapes that contain 
internal cavities such as shells and do nuts. [24] Additionally this 
method uses by default a harmonic order of 15, with a maximum 
value of 50. This gives the method a complexity of 0(MN) with 
M=q^D^. This can lead to errors when a harmonic order greater than 
50 is necessary based on the size of the protein and desired qmax. 

In CRYSOL there are several adjustable parameters used when 
calculating predicted data that best match the experimental curve. 
These parameters are: the effective atomic radii multiplier which scales 
the solvent volume displaced by each atom (vi), the electron density 
contrast of the surface solvent layer (cz) and the total displaced 
solvent volume (ci), approximately equal to the variation of the 
electron density of the displaced solvent relative to bulk water. The 
need for adjustable parameters in CRYSOL becomes clear when 
studying SAXS profile reproducibility for distinct samples of the 
same protein on different instruments. The characteristic features of 
the experimental scattering profiles are conserved between 
experiments, but the experimental variation of the scattered intensity 
at higher q- values depends on the extrapolated intensity at 1(0) [3 6]. 
Because of the beamstop in a SAXS experiment, 1(0) cannot be 
directly observed. One method to extrapolate this value is to compute 
the slope of the intensity profile in the initial linear region of the 
scattering profile (the Guinier region) and extrapolate to the y- 
intercept. The adjustable parameters in CRYSOL absorb this 
variability by changing the level of the higher-q features of the 
predicted data relative to the low-q intensities. 

Extension of CRYSOL to improve accuracy 

Fifteen years after the introduction of the original CRYSOL 
program, Alexander Grishaev, Liang Guo, Thomas Irving and Ad Bax 
introduced AXES in 2010 — a program for fitting SAXS data to 
macromolecular structure and ensembles of structures [3 6]. The 
program AXES was designed to be more discriminating than 
CRYSOL when evaluating poorly or incorrectly modeled protein 
structures. On a set of small well-studied proteins that had X-ray 
crystallography and solution NMR data they reported an 
improvement in fit by 10-50% by score. This set was comprised of 
four proteins — hen egg white lysozyme, cytochrome c, the B3 domain 
of protein G (GB3) and ubiquitin. 



They reformulated the approach to fitting SAXS data by 
explicitly taking into account the sources of experimental data 
variability: 

4xp(^) — hampleiR) ~ ^huffer + ^ (52) 

where a accounts for the uncertainty in the measurements and c 
accounts for the variability of the detector and X-ray fluorescence. 
These uncertainties appear responsible for the systematic difference 
between repeated experimental data sets. Taking these uncertainties 
into account, the computed scattering intensity is: 

Kq) = {{{\Aa(q) - PoAciq) + SpAh(q)\^}n}soiv)ens 

(53) 

where Q is the average taken over a discrete set of molecular 
orientations relative to the incident beam, solv is the average taken 
over the displaced and surface water sets, and ens is the average over 
the ensemble of macromolecular structures. The program AXES 
models the hydration shell directly by using explicit water molecules 
from a pre-equilibrated water box. 

Using this approach they tested how well they could discriminate 
different models of the same protein. They generated 2000 models 
of GB3 using Rosetta and fit the experimental SAXS data to all of the 
models usmg both CRYSOL and AXES. The CRYSOL fits yielded 
5( values that were much lower for poor models (models with a high 
RMSD relative to the native structure) than the native structure. This 
behavior is indicative of overfitting. Using AXES, they did not 
observe significantly better fits for the poor Rosetta models. 
Furthermore, when provided chemical shift guided Rosetta models 
with the correct fold, AXES correctly assigned higher values to non- 
native structures. 

The cost of this higher precision comes at the price of 
computation time. AXES is more than an order of magnitude slower 
than CRYSOL due to the averaging of the scattering amplitudes of 
the displaced and surface solvent sets over 20 different configurations. 
Among these configurations are: 6 elementary scattering functions 
averaged over angular orientations, macromolecular conformers, and 
molecular solvent configurations for a given electron density contrast 
of the surface solvent layer. Currently several avenues for computation 
speedup are under development. 

The use of Zernike polynomials to compute SAXS scattering 
profiles 

We previously mentioned three popular approaches for treating 
the hydration shell and excluded solvent. They were: 1 ) to place water 
molecules on the surface of the protein and compute scattering 
profiles with explicit water molecules, 2) simulate the solvation shell 
by surrounding the protein with a continuous outer envelope, 3) 
simulate the solvation shell and excluded volume by computing a 
modified scattering factor. The drawback to the first approach is the 
computational cost to construct the explicit solvent model. The 
drawback of the second approach occurs for proteins containing 
cavities. Assuming a uniform layer around a cavity or hole will 
introduce artificial areas without any electron density. The drawback 
of the third approach is the appearance of non-uniformities in the 
electron density by overlapping dummy atoms. 

In 2012, Liu et al proposed a new method to address the 
limitations of excluded solvent and hydration shell modeling [30]. In 
their approach they parameterized the Fourier transform of the 
electron density distribution function p(r) by a Zernike polynomial 



Volume No: 8, Issue: I I , e20 1 308006 



Computational and Structural Biotechnology Journal | www.csbj.org 



RecDnstructinn of SAXS Prnfiles frnm Protein Structures 



expansion with spherical harmonics. Zernike polynomials are 
orthogonal functions on the unit ball. They reformulated the SAXS 
intensity calculation as: 



jniQ''^max)+ Jn+ziQ'^max) 
2n+3 



where jn is the spherical Bessel function of order n. 



(54) 



(55) 



(56) 



where Cnim is the Zernike moments from three-dimensional objects and 
knn'i is either a positive or negative coefficient given by: 



n+n' 



(57) 



The Zernike moments are computed by a linear combination of the 
geometric moments of the object: 



(58) 



where Mrst is the geometric moment and Xnlm^^^ coefficients. The 
procedure to compute the coefficients are given by the Novotni and 
Klein algorithm [37]. 



(59) 



The geometric moments are computed from a scattering object 
that has been segmented into a series of small volume cubes called 
voxels. Voxels are used in 3D graphics for the visualization and 
analysis of medical and scientific data. In this case the voxelization 
process maps electron density from the scatterer (or protein) into 
voxels from which the geometric moments can be computed. From 
this process, multiple sets of voxels are created: 1 ) P — the set of non- 
zero electron density voxels, 2) S+B — the set of voxels representing 
the excluded solvent and surface bound solvent, and 3) S — the set of 
voxels representing the excluded solvent. 

The Zernike moments of all three voxelized objects are combined 
by a weighted sum to produce one set of Zernike moments from 
which the scattering intensity is computed. The computational 
complexity of this algorithm is 0(N), but prior to computation, the 
voxelized object must be created in a preprocessing step. 

The advantage of the Zernike expansion method is that it can 
model holes or cavities of structures that spherical harmonics 
traditionally has difficulty with. This approach also incorporates all 
solvent-accessible surfaces into the overall scattering profile. When 
compared on a set of ten experimental proteins with high resolution 
crystal structures, this method had similar results with the spherical 
harmonic expansion method. This method offers an extension to 
spherical harmonic expansion methods that may improve the fit to 
experimental data by improved hydration shell and excluded volume 
treatment of structures with cavities or holes. It is included in the 
SASTBX software package. 



Summary of Techniques 



Year 



Method 



Complexity 
BigO M 



1994 



1995 



2010 



DALAI ( Debye with binned pairwise distance) 

N Nbins 

.(q) = 2.Kq) + 2F(q)|;m(r,)?'"(''-'-^^ 

j=l k=l 



CRYSOL (Multipole expansion and spherical harmonics) 

Lmax L N ^ 

L=0 m=-L j=l 

PHAISTOS (Debye with Bayesian modehng of form factor) 
.(q) = tt«q)fKq)^^ 

1=1 ]=i ' 

FOXS (Debye with approximated structure factor) 



N N 



2010 i(q) = 2,2, 



O(N') 



0(MN)[35] 



0(N2)f2 



(q2j)2)[32] 

M: number of atoms in the structure 

K: number of atoms described by a 
dummy body. Kave = 4.24 



2010 



AXES (multiple averaging with spherical harmonics and explicit water molecules) 
«<|Aa(q)-PoAc(q) + 6pAb(q)|^> 

fl)solv)ens 



□(MN)^"^^^ M: number of spherical grid points 



2011 



2012 



SAXSTER (Debye with explicit water molecules) 

N+WN+W . . 

. sin(q • rjjj 



2, Fi(q)Fj(q)- 



i=i j=i 



q-Fj 



1] 

SASTBX (3D Zernicke polynomials) 

rimax n 1 

2] 2] Z i'(-l)^""'^'''^nlmYlm(Wc,)b„(q) 

n=0 1=0 m=-l 



0([N+W]') 



0(MN)[35] 



(N:nax+1)^ 
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Recent developments for SAXS profile reconstruction using 
GPU acceleration 

In 2012, the SAXS algorithm in PHAISTOS was accelerated 
using general purpose graphical processing units (GPGPUs)[38]. 
This method utilizes Bayesian probability statistics to compute the 
form factors in the Debye equation for protein models built from 
either one or two scattering bodies. The speed up using CPU's was 
measured from protein sizes ranging from 64 to 8192 scattering 
bodies. They reported a 1 6x speed up for proteins with 64 scattering 
bodies. As the proteins increased in size the speed up increased to a 
maximum speed up of 394x for proteins with 8192 scattering bodies. 

Because of the uncertainty introduced into the accuracy of the 
Debye equation by approximation methods, we devised a method to 
compute the intensity directly without approximating structure factor 
calculations (unpublished). Furthermore, we model the hydration 
shell as a function of the solvent accessible surface area of a given 
atom analogous to FoXs. Our method BCL::SAXS offsets the high 
computational cost of the Debye formula by simultaneously 
computing multiple pieces of the equation using the parallel 
architecture of graphical processing units (CPUs). The Debye 
formula can be framed as an NxN square matrix of N-atom rows by 
N-atom columns where N is the number of atoms in a given protein. 
The pairwise Euclidean distances (rij) are calculated from the upper 
triangle of the matrix. The diagonal is set to zero and the lower 
triangle is a symmetric mirror of the upper triangle. Each GPU 
thread computes a partial Debye sum. 

Ivartiaiiq) = 111 FiiqW^l) ^^7^ (^0) 

This results into a matrix of q rows by N-atom columns where q is 
the momentum transfer and N is the total number of atoms. These 
partial values are summed across each column to complete the 
intensity computation: 

hotal(.R^ — Tif=ilpartial,n (61) 

This approach removes the uncertainty introduced by structure 
factor approximation while maintaining the efficiency of methods 
with structure factor approximations. The speed up using CPU's was 
measured from protein sizes ranging from 1832 atoms (PDB ID: 
1026) atoms to 91,846 (PDB ID: IVSZ). Usmg a GTX680 GPU 
card, we observed a 5x speed up for the smaller protein (1026). For 
the largest protein in our set (IVSZ) we observed a speed up of 
1707x for protein IVSZ using the same graphics card. By leveraging 
GPU's, we absorb the O(N^) cost while achieving substantial 
reduction in computation time without sacrificing accuracy by 
introducing approximations to the Debye formula. 

Conclusion 

In this review we focused on proteins as a scattering body, but 
RNA and DNA can be studied as well using SAXS. These algorithms 
represent a sampling of methods for SAXS profile reconstruction and 
are not representative of all the approaches that exist. Another 
approach that expands these ideas was published in 2012. In this 
work, Gumerov et. al proposed a Hierarchal algorithm based on a fast 
multipole method (FMM) to compute SAXS profiles [32]. For a 
review of timing and accuracy for protein of varying sizes and shapes 
with either spherical harmonic or Debye implementations we refer to 
their work. In each of the algorithms presented, there was a trade-off 



between speed and accuracy. In order to use the Debye formula for 
protein structure analysis, approximations were made to the equation 
to move terms out of the double sum. The uncertainty introduced by 
this approach is a subject for further study. In order to model with 
spherical harmonics, the correct harmonic order must be set and the 
shape complexity of the scattering body must be considered. We 
expect that more algorithms in the near future will take advantage of 
the parallelizable form of the Debye equation and use GPU 
acceleration to obtain the necessary computational speed without the 
uncertainty introduced by structure factor approximation and 
momentum transfer binning. 

Furthermore, to standardize testing of SAXS algorithms we echo 
the suggestion of Rambo and Tainer and believe a reference dataset 
should be created with experimental SAXS profiles and PDB 
models [3 5]. This dataset would be comprised of proteins of varying 
sizes and shapes and folds. All new and existing methods should be 
benchmarked against this set to identify strengths and weakness of 
any given algorithm. 
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